from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive/Shared\ drives/BA775\ -\ Team\ 4A
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
#plt.style.use('classic')
%matplotlib inline
import seaborn as sns
AirBnB has become one of the most popular and fastest growing travel lodging providers. In 2020, there is an estimated 14,000 new hosts joining the platform each month, all of them placing new listings up for rent.
Despite their growing popularity, AirBnB has not provided any specific guidance on how hosts should price their listing, providing only a suggestion that hosts should "search for comparable listings in your city or neighborhood to get an idea of market prices" on one of their FAQ pages.
This project aims to take a first step in providing clearer metrics to determine AirBnB listing prices.
Our goal is to identify factors that determine AirBnB listing prices in New York City by conducting the following tasks:
Ultimately, we hope to help AirBnB renters determine the best price to list for their rentals and identify areas of potential improvement in order to increase marketability of their listings.
NOTE FOR OUR READERS: To more easily navigate this notebook, keep an eye out for keywords like "takeaways", "conclusion", "summary". Almost each section and subsection has a summary or conclusion for readers to derive key insights from. For example, each subsection in our EDA has a 'Section Summary' in the beginning.
This project's primary dataset will be from InsideAirBnB.com, a third-party website that regularly scrapes data from the AirBnB website. The particular dataset we will work with consists of information on AirBnB listings in New York City, scraped on October 2020. http://insideairbnb.com/get-the-data.html
Unprocessed, the dataset had 44,666 data points and 77 columns.
To support our analysis and model, we will be drawing insights from external datasets:
raw = pd.read_csv('listings.csv')
raw.head()
raw.shape
raw.info()
Checking for duplicates in this dataset is very nuanced. There may be a case where a data point could be a duplicate or could be a listing identical in description but actually different listings (e.g. a host that has two listings that are very similar that are both rentable). This means that there is some manual checking and personal judgement that may be required to determine whether or not the data row is a duplicate.
raw.columns
dup = raw.duplicated(subset=['id'], keep=False)
raw[dup].shape
It seems there are no duplicate entries in our dataset based on the ID of the listing. With that said, there may be duplicate listings under the same conditions.
Let's see if there are any listings with the same name, location, and host:
dup = raw.duplicated(subset=['name','latitude','longitude','host_id'], keep=False)
raw[dup].shape
As seen above, there do seem to be a number of duplicates based on the listing name, location, and host. The question is now whether or not we should deal with these duplicates.
The issue is these duplicate rows may be...
Ultimately, since there are no data input / scrape errors (shown by unique listing IDs), we will not tamper with these 'duplicates'.
Since we're utilizing a dataset scraped from AirBnB's website, our new dataset contains a lot more variables than our first dataset from Kaggle, where it seems like the author in Kaggle had already cleaned and excluded irrelevant columns.
We will now do the same with the variables in this dataset that we won't be using for this project. Most of the columns we will drop relates to the host, web scraping, and listing information that are unlikely to affect price (e.g. host url, scrape id, host location, etc.)
cols_to_drop = ['host_url','calendar_updated','listing_url','scrape_id','last_scraped','neighborhood_overview','picture_url','host_name','host_since','host_response_time','host_location','host_about','host_thumbnail_url','host_picture_url','host_neighbourhood','host_listings_count','host_verifications','neighbourhood','bathrooms','calendar_last_scraped','first_review','last_review','license','calculated_host_listings_count','calculated_host_listings_count_entire_homes','calculated_host_listings_count_private_rooms','calculated_host_listings_count_shared_rooms']
clean_droppedcols = raw.drop(columns=cols_to_drop)
display(f'Previously: {raw.shape}', f'Now: {clean_droppedcols.shape}')
Amenities_Count column¶With amenities of a listing listed, it might be interesting to see if the amount of amenities could affect the listing's price
clean_droppedcols['amenities'][0]
len(clean_droppedcols['amenities'][0].strip("][").split('", "'))
amenities_count = []
for i in range(len(clean_droppedcols['amenities'])):
count = len(clean_droppedcols['amenities'][i].strip("][").split('", "'))
amenities_count.append(count)
clean_droppedcols['amenities_count'] = amenities_count
clean_droppedcols[['amenities_count','amenities']].head()
price variable¶When taking a closer look at the price of the listings, it seems that the data type is a string with dollar signs afixed to the front and commas seperating thousands. We need to convert this into a float so we can run analysis on it later on.
data = clean_droppedcols
type(data['price'][0])
raw['price'][0]
data['price'] = [float(data['price'][i].strip('$').strip(',').replace(',','')) for i in range(len(data['price']))]
# note the change in type
type(data['price'][0])
raw['price'].describe()
data['price'].describe()
Something notable from the summary statistics above is the existence of 0-value prices. Which should not be the case. Let's see how many there are.
price0 = data['price']==0
price0_count = data[price0]['price'].count()
price0_pct = data[price0]['price'].count() / data['price'].count()*100
print(price0_count, price0_pct)
With only 25 listings (0.055% of the whole dataset) being priced at 0, we can either fix them or remove them entirely. After some investigation into why these listings are priced at 0, a few possible explanations have come up:
Since we do not know why some of these listings are priced at 0 and considering how few of them there are, we will remove these listings from our dataset.
data = data[~price0]
data['price'].min()
bathroom variable¶data['bathrooms_text'].unique()
In the raw dataset, information on the listings' bathrooms has been read as an object type. This is likely due to the fact that there are a combination of letters and numbers due to the inclusion of the bathroom descriptions (i.e. shared vs private).
Therefore, we will now create 2 new columns (bath_count and bath_type) so we can run analysis on this feature.
# dealing will null values as it cannot be read by list comprhensions
bath = data['bathrooms_text']
bath = bath.fillna('0')
bath
# creating a list from the Pandas series for the list comprehension
bath_list=[b.lower() for b in bath]
# sorting the bathroom types and saving into a new list
bath_types = []
for b in bath_list:
if '0' in b:
bath_types.append('NA')
elif 'shared' in b:
bath_types.append('shared')
elif 'private' in b:
bath_types.append('private')
else:
bath_types.append('personal')
# creating a new column based on the previous for & if conditions
data['bath_type'] = bath_types
# splitting bathroom_text because the first word describes the count of bathrooms
bath_count = data.bathrooms_text.str.split(" ",expand=True,).drop(columns=[1,2])
# identifying half bathrooms manually since they are not identified numerically (e.g. Half-bath, Private half-bath, Shared half-bath)
bath_count.replace(['Half-bath','Private','Shared'],0.5,inplace=True)
# converting the numeric strings into a float
bath_count = bath_count.astype(float)
data['bath_count'] = bath_count
cols = ['bathrooms_text','bath_count','bath_type']
data[cols].head(10)
SUCCESS!
data.info()
raw['host_response_rate']
Taking a look at the information on our dataset, the first existance of null values is seen in the names and description columns.
data[data['description'].isna()][['price','number_of_reviews']].head(7)
data[data['name'].isna()][['price','number_of_reviews']].head(7)
Considering the fact that this project does not plan to use description or name in our exploration, it may be worth to leave the columns as is for the moment (perhaps even dropping it entirely later on).
The next few columns with null values pertain to the host, specifically their interaction with customers. We will deal with columns that should be a numeric but are currently objects.
data[data['host_response_rate'].isna()][['price','reviews_per_month']].head()
data[data['host_acceptance_rate'].isna()][['price','reviews_per_month']].head()
The queries above show that despite null values, the listings do seem to be active.
Information on the AirBnB page show that these values are based on the last 30 days, so if the host has not interacted with customers for the past month, they will have null values. For this project, we will fill the null values with the mean values.
host_resp_rate = data['host_response_rate'].fillna('-1%')
type(host_resp_rate[0])
# data['host_response_rate'] =
int_host_resp_rate = [int(rate.strip('%')) for rate in host_resp_rate]
data['host_response_rate'] = int_host_resp_rate
data['host_response_rate'].replace({-1:np.nan}, inplace=True)
host_acc_rate = data['host_acceptance_rate'].fillna('-1%')
# data['host_response_rate'] =
int_host_acc_rate = [int(rate.strip('%')) for rate in host_acc_rate]
data['host_acceptance_rate'] = int_host_acc_rate
data['host_acceptance_rate'].replace({-1:np.nan}, inplace=True)
For the sake of clarity, we will rename two columns in our dataset. This will be helpful for EDA and when we join additional datasets later on.
data.rename(columns={"neighbourhood_cleansed": "neighbourhood", "neighbourhood_group_cleansed": "borough"},inplace=True)
To more easily colaborate, we worked on seperate notebooks which required us to export the dataset to a csv.
# data.to_csv('cleaned_listings.csv',index=False)
cd supplemental\ datasets
stations = pd.read_csv('http://web.mta.info/developers/data/nyct/subway/Stations.csv')
stations.head(3)
stations.info()
stations['Borough'].unique()
Since the borough information is saved as a code, we will have to convert them manually to their full names:
stations['Borough'].replace({'Q':'Queens','M':'Manhattan','Bk':'Brooklyn','Bx':'Bronx','SI':'Staten Island'}, inplace=True)
stations['Borough'].unique()
Aggregating the number of stations in a given borough:
stations_clean = stations.groupby('Borough').size().reset_index(name='Station_count')
stations_clean
Creating another merge with NYC population information so we can control for population size:
pop = pd.read_csv('https://data.cityofnewyork.us/resource/swpk-hqdp.csv')
pop.head(3)
To refrain from double-counting, we will only pull information from one year. The most recent census survey was conducted in 2010 so we will use information from 2010.
pop = pop[pop['year']==2010]
pop_clean = pop.groupby('borough').sum()['population'].reset_index()
stations_final = stations_clean.merge(pop_clean, left_on='Borough', right_on='borough').drop(columns='borough')
stations_final
When adding information on land size to data set to control for lannd size, it was easier to source information on each land size individually than importing a dataset, so we manually inputed the information based on census.gov, especially considering that land size does not change significantly across decades.
stations_final['size(sq-mi)'] = [42.10, 70.82, 22.83, 108.53, 58.37]
stations_final['stations_per_capita'] = stations_final['Station_count']/stations_final['population']*1000000
stations_final['stations_per_sq-mi'] = stations_final['Station_count']/stations_final['size(sq-mi)']
stations_final
# stations_final.to_csv('cleaned_stations.csv', index=False)
property = pd.read_csv('nyc-property-sales.csv')
property.shape
property.info()
We plan to extract the following information from the dataset:
All other variables and columns are irrelevant during this time.
cols = ['BOROUGH', 'NEIGHBORHOOD','SALE PRICE','GROSS SQUARE FEET']
property = property[cols]
property.head()
In this dataset, boroughs have been coded into numbers, as described in the Kaggle page:
"BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5)."
We will now convert this code to the appropriate name.
property['BOROUGH'].replace({1:'Manhattan',\
2:'Bronx',\
3:'Brooklyn',\
4:'Queens',\
5:'Staten Island'}, inplace=True)
It also seems that sale price is an object, as opposed to our expected numeric type; this is likely due to the existence of missing values.
property = property[~property['SALE PRICE'].str.contains("-")]
property['SALE PRICE'] = property['SALE PRICE'].astype(float)
We will now do the same for Gross Square Feet
property = property[~property['GROSS SQUARE FEET'].str.contains("-")]
property['GROSS SQUARE FEET'] = property['GROSS SQUARE FEET'].astype(float)
property
round(property.groupby('BOROUGH').agg({'count','mean','std'}),2)
property_boro = round(property.groupby('BOROUGH').mean().reset_index(),2)
# adding property price per sqft, controlling property prices for size
property_boro['property_price_per_sqft'] = property_boro['SALE PRICE']/property_boro['GROSS SQUARE FEET']
# renaming columns
col_names = ['borough','property_price','gross_sq_ft','property_price_per_sqft']
property_boro.columns = col_names
property_boro
# property_boro.to_csv('property_sales_by_boro.csv', index=False)
Subway Stations Merge
stations = pd.read_csv('cleaned_stations.csv')
stations
data = data.merge(stations, left_on='borough',right_on='Borough')\
.drop(columns=['Borough','population','size(sq-mi)'])\
.rename(columns={'stations_per_capita':'stations_per_capita_in_boro','stations_per_sq-mi':'stations_per_sq-mi_in_boro'})
data.columns[-3:]
Property Sales Merge
property = pd.read_csv('property_sales_by_boro.csv')
property
data = data.merge(property, on='borough').drop(columns='gross_sq_ft')
data.columns[-2:]
# cd ..
# data.to_csv('merged_dataset.csv', index=False)
# cd 'supplemental datasets'/
df = data
df.info()
df.head(10)
Upon exploration, there seems to be a lot of outliers in the listing prices. In order to accurately explore the data without skewing the findings with outliers, we decided to split the dataset. One table will consist of listing below 1000 USD (inclusive), the other above 1000 USD.
Let's briefly look at the distribution of prices across all the listings.
df['price'].describe()
plt.figure(figsize=(30, 6),facecolor='w')
plt.xticks(np.arange(0,10000, step=500))
sns.violinplot('price',data=df, whis=1)
plt.figure(figsize=(30, 6),facecolor='w')
plt.xticks(np.arange(0,10000))
plt.xscale('log')
sns.boxplot('price',data=df, whis=1)
price_filter_above1000 = df['price'] > 1000
pct_above1000 = df[price_filter_above1000]['price'].count()/df['price'].count()*100
print(f'{round(pct_above1000,3)}% of the listings in our dataset are priced above $1000')
With an overwhelming majority of the listings in our dataset under $1000 (over 99%), it is unwise to conduct our exploration on the entirety of the dataset as these high-priced outliers may heavily skew our exploration. On the other hand, we don't want to exclude these high-priced listings in our ML model later on so we do need to conduct EDA on them.
With these two conditions in mind, we have decided to run two EDAs; one for low/med-priced listings and another for high-priced listings (> $1000).
Our hope is that running these two EDAs will allow us to develop two ML models, one for low-priced listings and another for high-priced listings.
df1 = df[~price_filter_above1000]
df1['price'].describe()
df2 = df[price_filter_above1000]
df2['price'].describe()
Prior to digging deeper into other variables, let's briefly take a look at the correlation between price and all the numerical columns.
With a large amount of variables, let's take a look at the 5 most correlated variables (note that we are using absolute values as we only want to see how correlated they are to price; we are not concerned with direction at this time).
table = abs(df1.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with AirBnB price'}).set_index('variable').drop(index='price')).sort_values('correlation with AirBnB price', ascending=False)
table.head(5)
fig = sns.barplot(data=table.reset_index().head(5), y='correlation with AirBnB price',x='variable')
fig.set_xticklabels(fig.get_xticklabels(), rotation=30, horizontalalignment='right')
sns.set(rc={'figure.figsize':(10,8)})
For low-priced listings, the top 3 most correlated variables is the number of people the listing accomodates, the number of bedrooms, and the number of beds.
table = abs(df2.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with AirBnB price'}).set_index('variable').drop(index='price')).sort_values('correlation with AirBnB price', ascending=False)
table.head(5)
fig = sns.barplot(data=table.reset_index().head(5), y='correlation with AirBnB price',x='variable')
fig.set_xticklabels(fig.get_xticklabels(), rotation=30, horizontalalignment='right')
sns.set(rc={'figure.figsize':(10,8)})
For high-priced listings, the top 3 most correlated variables is the number of bedrooms, availability_365, and the number of amenities. With that said, these values are low, so there may not be any significant relationship present.
While we did briefly take a look at price variations between neighborhoods, this project currently focuses at price variations at the borough-level. Even at this level, we see significant price variations across the boroughs at both price levels.
First, it seems that most listings are located in Manhattan, followed by Brooklyn, then Queens. This is the case across both high and low-priced listings.
It should be noted that most of the data points for listings above $1000 are located in Manhattan, with only one high-priced listing in Staten Island and Bronx. You could infer that the N size is too small to reach any concrete conclusions for high-priced listngs.
# Initializing color scheme for boroughs
boro_colors = {'Manhattan': "#CC1433", 'Brooklyn': "#5998C5", 'Queens': "#8E7DBE", 'Bronx':'#ffbd00', 'Staten Island': '#49802B'}
This section will cover initial EDA on AirBnB listings priced under $1000, so we will be using df1 as seperated in the Price EDA section.
Let's take a look at what we're working with:
cols = ['borough','neighbourhood']
df1[cols].describe()
It seems that Manhttan is the borough with the most listings with over 19,500 listings, while Williamburg, Brooklyn is the neighbourhood with the most listings with ~3000.
Let's see if we can visualize these insights:
order = ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']
sns.set(rc={'figure.figsize':(10,8)})
sns.countplot(data=df1,y='borough',order=order, palette=boro_colors)
plt.title('Number of AirBnB Listings by NYC Borough priced below $1000', fontdict={'fontsize':14} ,pad=10)
plt.ylabel('Borough')
plt.xlabel('Number of Listings')
Since our project will try to predict prices, we will look into the potential relationship between the location, at the borough-level, and the price of the listings.
Let's take a look at the descriptive statistics:
df1.groupby('borough')['price'].describe().sort_values('mean',ascending=False)
Quick Insights:
sns.set(rc={'figure.figsize':(10,15)})
sns.catplot(data=df1, y='price',x='borough',alpha=0.2, aspect=2, palette=boro_colors)
plt.ylabel('Price [$]')
plt.xlabel('Borough')
sns.set(rc={'figure.figsize':(10,8)})
sns.violinplot(y = 'price',data=df1, x='borough', palette=boro_colors)
plt.ylabel('Price [$]')
plt.xlabel('Borough')
fig = df1.groupby('borough')['price'].agg({'mean'}).reset_index().sort_values('mean', ascending=False)
sns.set(rc={'figure.figsize':(10,8)})
sns.barplot(y='mean', x='borough', ci=None, data=fig, palette=boro_colors) #,kind='bar')
plt.ylabel('Average Price [$]')
plt.xlabel('Borough')
# getting a list of boroughs
boro = list(df1['borough'].unique())
# displaying most expensive neighborhoods in each borough
for b in boro:
display(df1[df1['borough']==b].pivot_table(values='price', index=['borough','neighbourhood'],\
aggfunc='mean').sort_values('price', ascending=False).head(3))
Next, let's conduct the same exploration for higher-priced listings. For this we will be using df2 as our dataset:
cols = ['borough','neighbourhood']
df2[cols].describe()
It seems that Manhattan is still the borough with the most listings in this case. Midtown, Manhattan has overthrown Williamsburg as the neighbourhood with the most listings.
sns.countplot(data=df2, y='borough',order=order,palette=boro_colors)
plt.title('Number of AirBnB Listings by NYC Borough priced above $1000', fontdict={'fontsize':14} ,pad=10)
plt.ylabel('Borough')
plt.xlabel('Number of Listings')
Like before, let's explore the relationship between these two variables for expensive listings.
df2.groupby('borough')['price'].describe().sort_values('mean',ascending=False)
Quick Insights:
sns.set(rc={'figure.figsize':(10,15)})
sns.catplot(data=df2, y='price',x='borough',alpha=0.5, aspect=1.5, palette=boro_colors)
plt.ylabel('Price [$]')
plt.xlabel('Borough')
sns.set(rc={'figure.figsize':(10,8)})
sns.violinplot(y = 'price',data=df2, x='borough', palette=boro_colors)
plt.ylabel('Price [$]')
plt.xlabel('Borough')
The two graphs show that distributions seem to vary across listings. Additionally, with very few data points in Queens, it gives the impression that the price distribution in Queens is not as concentrated as Manhattan or Brooklyn.
It's also worth noting that Manhattan has a few lisitngs hitting the max, which seems to be capped at $10,000 (likely by AirBnB).
fig = df2.groupby('borough')['price'].agg({'mean'}).reset_index().sort_values('mean', ascending=False)
sns.set(rc={'figure.figsize':(10,8)})
sns.barplot(y='mean', x='borough', ci=None, data=fig, palette=boro_colors) #,kind='bar')
plt.ylabel('Average Price [$]')
plt.xlabel('Borough')
# getting a list of boroughs
boro = list(df2['borough'].unique())
# displaying most expensive neighborhoods in each borough
for b in boro:
display(df2[df2['borough']==b].pivot_table(values='price', index=['borough','neighbourhood'],\
aggfunc='mean').sort_values('price', ascending=False).head(3))
Similar to our BA775 project, we will explore why there are variations in prices among the locations.
For this project, we will continue to look at:
We will exclude exploration around crime data since our previous project showed little to no correlation.
As seen
df1.head()
cols = ['Station_count', 'stations_per_capita_in_boro','stations_per_sq-mi_in_boro']
df1.groupby('borough')[cols].mean().sort_values('stations_per_capita_in_boro',ascending=False)
g = df1.groupby('borough')[cols].mean().sort_values('stations_per_capita_in_boro',ascending=False).reset_index()
fig, ax = plt.subplots(1,3)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
count = sns.barplot(y='Station_count',x='borough',data=g, ax=ax[0], palette=boro_colors)
count.set(ylabel='Number of Stations', xlabel = 'NYC Borough', title='Total Count')
count_per_cap = sns.barplot(y='stations_per_capita_in_boro',x='borough',data=g, ax=ax[1], palette=boro_colors)
count_per_cap.set(ylabel='Number of Stations per 1 million people', xlabel = 'NYC Borough', title='Controlling for Population')
count_per_sqmi = sns.barplot(y='stations_per_sq-mi_in_boro',x='borough',data=g, ax=ax[2], palette=boro_colors)
count_per_sqmi.set(ylabel='Number of Stations per square-mile', xlabel = 'NYC Borough', title='Controlling for Land Size')
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
cols = ['Station_count', 'stations_per_capita_in_boro',
'stations_per_sq-mi_in_boro','price','borough']
listing_count = df1.groupby('borough')['price'].count().reset_index().rename(columns={'price':'Listing Count'})
g = df1[cols].groupby('borough').mean().reset_index().merge(listing_count, on='borough')
g
fig, ax = plt.subplots(1,3)
sns.set(rc={'figure.figsize':(15,10)}) # Figure size
sns.scatterplot(x='Station_count',y='price',data=g, ax=ax[0], hue='borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
sns.scatterplot(x='stations_per_capita_in_boro',y='price',data=g, ax=ax[1], hue= 'borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
sns.scatterplot(x='stations_per_sq-mi_in_boro',y='price',data=g, ax=ax[2], hue='borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
It seems that there is a correlation between the average price of listings and the number of subway stations (especially when controlling for population or land size).
Let's see what that correlation coefficients are and what that might look like in a regression plot:
g.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with price'}).\
set_index('variable').drop(index='price')
fig, ax = plt.subplots(1,3)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
sns.regplot(x='Station_count',y='price',data=g, ax=ax[0],)
sns.regplot(x='stations_per_capita_in_boro',y='price',data=g, ax=ax[1])
sns.regplot(x='stations_per_sq-mi_in_boro',y='price',data=g, ax=ax[2])
cols = ['Station_count', 'stations_per_capita_in_boro',
'stations_per_sq-mi_in_boro','price','borough']
listing_count = df2.groupby('borough')['price'].count().reset_index().rename(columns={'price':'Listing Count'})
g = df2[cols].groupby('borough').mean().reset_index().merge(listing_count, on='borough')
g
fig, ax = plt.subplots(1,3)
sns.set(rc={'figure.figsize':(15,10)}) # Figure size
sns.scatterplot(x='Station_count',y='price',data=g, ax=ax[0], hue='borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
sns.scatterplot(x='stations_per_capita_in_boro',y='price',data=g, ax=ax[1], hue= 'borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
sns.scatterplot(x='stations_per_sq-mi_in_boro',y='price',data=g, ax=ax[2], hue='borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
The scatter plots show that the correlation has disappeared. Let's confirm that with a correlation coefficient table and regression plot.
g.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with price'}).\
set_index('variable').drop(index='price')
fig, ax = plt.subplots(1,3)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
sns.regplot(x='Station_count',y='price',data=g, ax=ax[0],)
sns.regplot(x='stations_per_capita_in_boro',y='price',data=g, ax=ax[1])
sns.regplot(x='stations_per_sq-mi_in_boro',y='price',data=g, ax=ax[2])
Our suspicion was correct–there seems to be a significant decrease in correlation. It seems that the expensive listings in Queens is throwing off our correlations. Out of curiosity, let's see what happnes when we filter Queens out.
new_g = g.set_index('borough').drop(index='Queens').reset_index()
new_g
new_g.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with price'}).\
set_index('variable').drop(index='price')
fig, ax = plt.subplots(1,3)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
sns.regplot(x='Station_count',y='price',data=new_g, ax=ax[0],)
sns.regplot(x='stations_per_capita_in_boro',y='price',data=new_g, ax=ax[1])
sns.regplot(x='stations_per_sq-mi_in_boro',y='price',data=new_g, ax=ax[2])
Our suspicion that Queens was throwing off the correlation was correct. We may be tempted to remove Queens as a datapoint in our exploration but if we do that, we won't be able to include expensive listings located in Queens in future iterations of our model.
Therefore, we have to conclude that either...
We suspect it is the latter, because the types of people who rent listings price above $1000 a day are unlikely to be the type or people to take the subway. We may need to utilize an alternative variable to measure transportation accessibility.
Ultimately, we will opt to exclude this variable for our model for expensive listings.
We will now look at how property sale prices affect the average listing prices across the five boroughs.
Source: https://www.kaggle.com/new-york-city/nyc-property-sales
cols = ['property_price', 'property_price_per_sqft']
round(df1.groupby('borough')[cols].mean().sort_values('property_price',ascending=False),2)
g = round(df1.groupby('borough')[cols].mean().sort_values('property_price',ascending=False).reset_index(),2)
fig, ax = plt.subplots(1,2)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
count = sns.barplot(y='property_price',x='borough',data=g, ax=ax[0], palette=boro_colors)
count.set(ylabel='Average Property Price [Million $]', xlabel = 'NYC Borough', title='Total Count')
count_per_cap = sns.barplot(y='property_price_per_sqft',x='borough',data=g, ax=ax[1], palette=boro_colors)
count_per_cap.set(ylabel='Average Property Price per sqft [$]', xlabel = 'NYC Borough', title='Controlling for Population')
cols = ['property_price', 'property_price_per_sqft','price','borough']
listing_count = df1.groupby('borough')['price'].count().reset_index().rename(columns={'price':'Listing Count'})
g = round(df1[cols].groupby('borough').mean().reset_index().merge(listing_count, on='borough'),2)
g
fig, ax = plt.subplots(1,2)
sns.set(rc={'figure.figsize':(15,10)}) # Figure size
A = sns.scatterplot(x='property_price',y='price',data=g, ax=ax[0], hue='borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
B = sns.scatterplot(x='property_price_per_sqft',y='price',data=g, ax=ax[1], hue= 'borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
A.set(ylabel='Average AirBnB Listing Price[$ per night]', xlabel = 'Average Property Value [Million $]')
B.set(ylabel='Average AirBnB Listing Price[$ per night]', xlabel = 'Average Property Value per sqft [$]')
It seems that there is a correlation between the average price of listings and the average property values in the boroughs.
Let's see what that correlation coefficients are and what that might look like in a regression plot:
g.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with AirBnB price'}).\
set_index('variable').drop(index=['price','Listing Count'])
fig, ax = plt.subplots(1,2)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
A =sns.regplot(x='property_price',y='price',data=g, ax=ax[0])
B = sns.regplot(x='property_price_per_sqft',y='price',data=g, ax=ax[1])
A.set(ylabel='Average AirBnB Listing Price [$ per night]', xlabel = 'Average Property Value [Million $]')
B.set(ylabel='Average AirBnB Listing Price [$ per night]', xlabel = 'Average Property Value per sqft [$]')
While it makes sense that a correlation is present between these variables, it's interesting to note that there's a higher correlation between average property values over the values per sqft (controlling for size).
cols = ['property_price', 'property_price_per_sqft','price','borough']
listing_count = df2.groupby('borough')['price'].count().reset_index().rename(columns={'price':'Listing Count'})
g = round(df2[cols].groupby('borough').mean().reset_index().merge(listing_count, on='borough'),2)
g
fig, ax = plt.subplots(1,2)
sns.set(rc={'figure.figsize':(15,10)}) # Figure size
A =sns.scatterplot(x='property_price',y='price',data=g, ax=ax[0], hue='borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
B =sns.scatterplot(x='property_price_per_sqft',y='price',data=g, ax=ax[1], hue= 'borough',size='Listing Count',sizes=(20,300),legend='brief', palette=boro_colors)
A.set(ylabel='Average AirBnB Listing Price [$ per night]', xlabel = 'Average Property Value [Million $]')
B.set(ylabel='Average AirBnB Listing Price [$ per night]', xlabel = 'Average Property Value per sqft [$]')
It seems that there is a correlation between the average price of listings and the average property values (controlled for size).
Let's see what that correlation coefficients are and what that might look like in a regression plot:
g.corr()['price'].reset_index().rename(columns={'index':'variable','price':'correlation with AirBnB price'}).\
set_index('variable').drop(index=['price','Listing Count'])
fig, ax = plt.subplots(1,2)
sns.set(rc={'figure.figsize':(22,10)}) # Figure size
A = sns.regplot(x='property_price',y='price',data=g, ax=ax[0])
B = sns.regplot(x='property_price_per_sqft',y='price',data=g, ax=ax[1])
A.set(ylabel='Average AirBnB Listing Price [$ per night]', xlabel = 'Average Property Value [Million $]')
B.set(ylabel='Average AirBnB Listing Price [$ per night]', xlabel = 'Average Property Value per sqft [$]')
So far it seems that the average property value per sqft has been the best explanation for the price variations across the boroughs for high-priced AirBnB listings.
Do characteristics of the host have any relation to the price? We could hypothesize that higher quality hosts (i.e. superhosts with high response rates) could charge higher prices for their listings, but let's see if that's the case.
Based on our EDA, there doesn't seem to be much correlation between the host's statistics and information against the price of the listings for both low and high-priced listings.
For both low and high-priced listings, hosts tend to have profile pictures and verified identities, and most hosts are not superhosts. These categorical values have no correlation to the price of listings.
The same distribution pattern exists in the numerical variables, where host response and acceptance rates tend to be high, while host listing counts tend to be low. Again, no correlation seems to be present between these variables and the listing price.
Considering the low correlation values of these host variables, we will not include them in the predictive ML model.
Areas of Improvement:
host_columns = ['host_id', 'host_response_rate', 'host_acceptance_rate','host_is_superhost', 'host_total_listings_count','host_has_profile_pic', 'host_identity_verified']
TF = {0: "#CC1433", 1: '#49802B'}
There are two types of host features: numerical and categorical.
Let's take a look at how the categorical factors play a role.
host_cat = ['host_is_superhost','host_has_profile_pic', 'host_identity_verified']
df1[host_cat].describe()
It looks like hosts of low-priced listings tend to...
Since we want to look at this information at a deeper level, let's convert these values to dummy values.
df1[host_cat] = df1[host_cat].replace(['t','f'],[1,0])
df1[host_cat].describe()
This gives us a clearer picture of the distribution: it looks like 18% of hosts are superhosts, 99.7% have profile pictures, and 77% have verified their identities.
Let's visualize some of these points:
fig, ax = plt.subplots(1,3)
A = sns.countplot(data=df1, x='host_is_superhost',ax=ax[0], palette=TF)
B = sns.countplot(data=df1, x='host_has_profile_pic',ax=ax[1], palette=TF)
C = sns.countplot(data=df1, x='host_identity_verified',ax=ax[2], palette=TF)
A.set_title('Is the host a superhost?')
B.set_title('Does the host have a profile picture?')
C.set_title("Is the host's identity verified?")
How do these variables relate to the price?
for var in host_cat:
display(df1.groupby(var)['price'].describe())
fig, ax = plt.subplots(1,3)
sns.stripplot(data=df1, y='price' ,x='host_is_superhost', palette=TF, alpha=0.1, ax=ax[0])
sns.stripplot(data=df1, y='price' ,x='host_has_profile_pic', palette=TF, alpha=0.1, ax=ax[1])
sns.stripplot(data=df1, y='price' ,x='host_identity_verified', palette=TF, alpha=0.1, ax=ax[2])
Based on the descriptive statistics and visualization of the categorical variables against price, it seems that there is no clear correlation between the variables.
First, most hosts have a profile picture, so any difference between the two should not be relevant as the sample size for hosts without profile pictures is too small.
On average, superhosts tend to charge higher prices but the difference is neglible when comparing to the standard deviation.
Similarly, hosts with verified identities also do not present any statistically significant variation between prices.
Let's confirm this with a simple correlation matrix:
host_cat_price = host_cat + ['price']
df1[host_cat_price].corr().drop(columns=host_cat)
With significantly low correlation for all the categorical variables, we can conclude that there is no relationship between price and these variables for low-priced listings.
Let's take a look at how some of the continous variables relating to the host are distributed and how it relates to the listing prices.
host_num_col = ['host_response_rate', 'host_acceptance_rate', 'host_total_listings_count']
df1[host_num_col].describe()
Host of low-priced listings tend to have high average response and acceptance rates, but it may be worth noting that the median rates are higher than average values, suggesting that low rates are dragging the average down.
The opposite is true for the host listing count, where the majority of hosts only seem to have 1 or 2 listings and yet the average is 15 listings.
Let's visualize this:
fig, ax = plt.subplots(1,3)
sns.violinplot(data=df1,x='host_response_rate',ax=ax[0])
sns.violinplot(data=df1,x='host_acceptance_rate',ax=ax[1])
sns.violinplot(data=df1,x='host_total_listings_count',ax=ax[2])
Our hypothesis about host listing count was true, where a few hosts have a high number of listings dragging the average up.
Host response and acceptance rates tend to group up on the higher end with a few 0-values. Further investigation may be required as to why the 0-values exist.
Let's see how these variables relate to price:
fig, ax = plt.subplots(1,3)
sns.scatterplot(data=df1,x='host_response_rate',y='price',ax=ax[0], alpha=0.3)
sns.scatterplot(data=df1,x='host_acceptance_rate',y='price',ax=ax[1], alpha=0.3)
sns.scatterplot(data=df1,x='host_total_listings_count',y='price',ax=ax[2], alpha=0.3)
host_num_price = host_num_col + ['price']
df1[host_num_price].corr().drop(columns=host_num_col)
Based on the correlation values, there is very little correlation between price and the numerical variables for low-priced listings.
host_cat = ['host_is_superhost','host_has_profile_pic', 'host_identity_verified']
df2[host_cat].describe()
It looks like hosts of high-priced listings tend to...
This is the same for low-priced listings.
df2[host_cat] = df2[host_cat].replace(['t','f'],[1,0])
df2[host_cat].describe()
This gives us a clearer picture of the distribution: it looks like 12% of hosts are superhosts, 99.3% have profile pictures, and 73% have verified their identities. Comparable to the values of low-priced listings.
Let's visualize some of these points:
fig, ax = plt.subplots(1,3)
A = sns.countplot(data=df2, x='host_is_superhost',ax=ax[0], palette=TF)
B = sns.countplot(data=df2, x='host_has_profile_pic',ax=ax[1], palette=TF)
C = sns.countplot(data=df2, x='host_identity_verified',ax=ax[2], palette=TF)
A.set_title('Is the host a superhost?')
B.set_title('Does the host have a profile picture?')
C.set_title("Is the host's identity verified?")
How do these variables relate to the price?
for var in host_cat:
display(df2.groupby(var)['price'].describe())
fig, ax = plt.subplots(1,3)
sns.stripplot(data=df2, y='price' ,x='host_is_superhost', palette=TF, alpha=0.5, ax=ax[0])
sns.stripplot(data=df2, y='price' ,x='host_has_profile_pic', palette=TF, alpha=0.5, ax=ax[1])
sns.stripplot(data=df2, y='price' ,x='host_identity_verified', palette=TF, alpha=0.5, ax=ax[2])
Based on the descriptive statistics and visualization of the categorical variables against price, it also seems that there is no clear correlation between the variables for high-priced listings. The conclusions for the low-priced listings apply here too.
Notably, the average prices between whether or not the host has a profile picture may appear significant, but a deeper look at the low sample size of hosts without profile pictures suggests we cannot reach this conclusion.
Let's confirm this with a simple correlation matrix:
host_cat_price = host_cat + ['price']
df2[host_cat_price].corr().drop(columns=host_cat)
With significantly low correlation for all the categorical variables, we can conclude that there is no relationship between price and these variables for high-priced listings.
Let's take a look at how some of the continous variables relating to the host are distributed and how it relates to the listing prices.
host_num_col = ['host_response_rate', 'host_acceptance_rate', 'host_total_listings_count']
df2[host_num_col].describe()
fig, ax = plt.subplots(1,3)
sns.violinplot(data=df1,x='host_response_rate',ax=ax[0])
sns.violinplot(data=df1,x='host_acceptance_rate',ax=ax[1])
sns.violinplot(data=df1,x='host_total_listings_count',ax=ax[2])
The conclusions we reached about the distirbution of these values on low-priced listings apply here also, whereby average and median host response and acceptance rates are high, but there are 0-values that are dragging the average down.
Additionally, most hosts have only 1-2 listings but a few hosts with a large number of listings are dragging the average up.
Let's see how these variables relate to price:
fig, ax = plt.subplots(1,3)
sns.scatterplot(data=df2,x='host_response_rate',y='price',ax=ax[0], alpha=0.7)
sns.scatterplot(data=df2,x='host_acceptance_rate',y='price',ax=ax[1], alpha=0.7)
sns.scatterplot(data=df2,x='host_total_listings_count',y='price',ax=ax[2], alpha=0.7)
host_num_price = host_num_col + ['price']
df2[host_num_price].corr().drop(columns=host_num_col)
Based on the correlation values, there is very little correlation between price and the numerical variables for low-priced listings.
Our next section will explore how listing details could potentially affect price. Does the type of listing (i.e. hotel room vs entire home) affect prices? What about the number of people it accomodates?
According to our EDA, most listings on AirBnB are entire homes and apartments, but other than that, our findings are varied between high and low priced listings.
columns =['accommodates', 'beds','room_type', 'bath_count','bedrooms','bath_type','price','amenities_count']
listings1 = df1[columns]
listings1.info()
There are two categorical variables present in on our table, bath_type and room_type. Bath_type was a column we created earlier to signifiy the type of bathrooms the listing has based on bathroom_text, so we won't look into that variable until we explore bathrooms.
Let's see if the room type is distributed across listings and how it relates to price.
listings1.groupby('room_type')['price'].count().sort_values(ascending=False).reset_index().rename(columns={'price':'count'})
With over 22,000 listings, it seems entire homes are the most popular type of listing.
Let's see how this looks like:
sns.countplot(data=listings1,x='room_type')
How does the room types relate to price?
listings1.groupby('room_type')['price'].describe()
fig, ax = plt.subplots(1,2)
sns.stripplot(data=listings1,x='room_type',y='price',alpha=0.2, ax=ax[0])
sns.violinplot(data=listings1,x='room_type',y='price', ax=ax[1])
Despite low count of hotel rooms, they seem to the most expensive on average. This brings into consideration the fact that hotel rooms may bring in an external factor that cannot be captured with these variables (e.g. an elevated level of service like daily housekeeping).
Considering the goal of the project (i.e. helping regular hosts, not hotels, determine prices) and limited number of hotel rooms, we may consider filtering out hotel rooms in our project.
columns = ['beds', 'bath_type', 'bath_count', 'bedrooms']
listings1[columns].describe()
On average, listings tend to only have one bed, bedroom, and bathrooms.
Let's dig a little deeper through visualization:
g = sns.PairGrid(df1, y_vars=["price"], x_vars=["bedrooms","beds", "bath_count"], height=4.5, hue="room_type", aspect=1.1)
ax = g.map(plt.scatter, alpha=0.2)
g.add_legend();
Quick insights:
Let's see what the correlation matrix looks like:
listings1.corr().sort_values('price', ascending=False)
It seems the highest correlation with price are the bedrooms and beds count.
Note that the number of people the listing accomodates has a higher correlation to price, and the bedrooms and beds counts are highly correlated to accomodates too. This suggests some multicolinearity. So if we decide to include one of these values, we should only include one and not all three.
Let's see how the number of people the listing accomodates play a factor in price.
listings1['accommodates'].describe()
listings1.groupby('room_type')['accommodates'].describe()
Looking at the breakdown of the variable. It seems that the average listing accomodates 2-3 people. When breaking down by room_type, entire homes/apt tends to host more people (which was to be expected), followed by hotel rooms.
listings1[['accommodates','price']].corr()
a = sns.regplot(data=df1, y='price',x='accommodates')
a.set_xticks(range(0,20,2))
As seen before, there is somewhat of a correlation present between price and the number of people the listing accomodates, but it is not very strong with only a score of 0.48.
table = df1[['accommodates','price','room_type']].groupby('room_type').corr().reset_index().drop(columns='accommodates')
filter_out = table['level_1']=='price'
table[~filter_out].drop(columns='level_1').rename(columns={'price':'Correlation between price and accommodates'}).set_index('room_type')
Oddly, enough when controlling for room type, the correlation seems to get worse. Further exploration may be required to see why this is the case.
listings1['amenities_count'].describe()
sns.violinplot(data=listings1,x='amenities_count')
The distribution of amenities count seems to be centered between 10-30 amenities, with the average being ~19.
Does it have any correlation to price?
listings1[['amenities_count','price']].corr()
sns.scatterplot(data=listings1,y='price',x='amenities_count',hue='room_type',alpha=0.7)
Based on the correlation matrix and scatter plot, there does not seem to be any strong correlation present.
columns =['accommodates', 'beds','room_type', 'bath_count','bedrooms','bath_type','amenities_count','price']
listings2 = df2[columns]
listings2.info()
Let's see if the room type is distributed across listings and how it relates to price.
listings2.groupby('room_type')['price'].count().sort_values(ascending=False).reset_index().rename(columns={'price':'count'})
Similar to low-priced listings, entire homes are the most popular type of high-priced listings.
Let's see how this looks like:
sns.countplot(data=listings2,x='room_type')
How does the room types relate to price?
listings2.groupby('room_type')['price'].describe()
fig, ax = plt.subplots(1,2)
sns.stripplot(data=listings2,x='room_type',y='price',alpha=0.5, ax=ax[0])
sns.violinplot(data=listings2,x='room_type',y='price', ax=ax[1])
This time, hotel rooms are the cheapest on average, while private rooms and shared rooms are the most expensive.
It is interesting to see how entire homes and apartments do not have the highest median or average price. It could be because, other types have significantly fewer listings (i.e. small sample size), which means we may not be able to confidently reach any conclusions.
columns = ['beds', 'bath_type', 'bath_count', 'bedrooms']
listings2[columns].describe()
High-priced listings seem to have higher values for these variables, compared to lower priced listings.
g = sns.PairGrid(df2, y_vars=["price"], x_vars=["bedrooms","beds", "bath_count"], height=4.5, hue="room_type", aspect=1.1)
ax = g.map(plt.scatter, alpha=0.5)
g.add_legend();
The scatter plots don't provide any clear takeaways for high-priced listings in regards to the distribution. However, it does provided glimpse of negative correlations with price. Is that really the case?
listings2.corr().sort_values('price', ascending=False)
Our guess was correct, where the correlation these variables have with price is negative. It is very odd and does not follow our expectations.
listings2['accommodates'].describe()
listings2.groupby('room_type')['accommodates'].describe()
Looking at the breakdown of the variable. It seems that the average listing accomodates 6 people. Similar to before, entire homes/apt tends to host more people (which was to be expected), but this time, followed by shared rooms.
It may be due to the low sample size of 7 shared rooms, but it is odd that a shared room above $1000 can accomodate 5 people on average.
listings2[['accommodates','price']].corr()
a = sns.regplot(data=df2, y='price',x='accommodates')
a.set_xticks(range(0,20,2))
As described before, the correlation to price is negative. It is also not as strong as the correlation on low-priced listings. This may suggest that the number of people high-priced listings accomodates have little weight on its price.
listings2['amenities_count'].describe()
sns.violinplot(data=listings2,x='amenities_count')
The distribution of amenities countfor high-priced listings seems to be more normally distributed with most of its values centered between 10-35 amenities. The average seems to be still the same.
Does it have any correlation to price?
listings2[['amenities_count','price']].loc[listings2['price']<10000].corr()
sns.scatterplot(data=listings2,y='price',x='amenities_count',hue='room_type')
Based on the correlation matrix and scatter plot, the correlation between price and amenities count seems to be stronger for high-priced listings compared to low-priced listings, however, it is a negative correlation and is still quite weak, suggesting that this might be a coincidence more than anything else. This is further supported by the fact that the sample size is significantly smaller.
This section will invetsigate the distribution of review scores of AirBnB listings, the number of reviews a given listing has and how they relate to their prices. There are a number of review categories but we will only look at the overall rating.
For an overview, the categories are:
Our initial hypothesis was that demand is a good determinant of price (based on widely accepted economic principles) and a good way to capture demand was through the listings reviews. We assumed the higher the number of reviews or the higher the score, the more in-demand the listing is thus leading to higher prices.
Across the board (both high and low-priced listings), we found that this was not actually the case:
Ultimately, we have opted to exclude review-related variables in our models, with the exception of overall review ratings. Instead of excluding the feature, we will create two models, one with and one without the feature, and compare their performances.
number_of_reviews = df[['number_of_reviews','price']]
number_of_reviews.head()
#drop value of number_of_reviews == 0
data2 = number_of_reviews.drop(number_of_reviews[number_of_reviews['number_of_reviews'] ==0].index)
#price <1000
data2_0 = data2.drop(data2[data2['price']>1000].index)
data2_0_mean_low_price = data2_0.groupby('number_of_reviews').agg({'price':'mean'})
data2_0.plot(kind='scatter', x='number_of_reviews', y='price',alpha=0.3,color = '#86bf91')
plt.xlabel("Total Reviews",fontsize = 25)
plt.ylabel("Price",fontsize = 25)
plt.title('Number of Reviews & Price',fontsize = 30)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.show()
data2_0_mean_low_price.plot(kind = 'hist',bins=15,color = '#86bf91')
plt.xlabel("Price")
plt.ylabel("Total Reviews")
plt.title('Number of Reviews & Price')
plt.show()
data2_0_0 = data2.drop(data2[data2['price']>1000].index)
data2_0_0 = data2_0_0.drop(data2_0_0[data2_0_0['price']<200].index)
data2_0_mean_1 = data2_0_0.groupby('number_of_reviews').agg({'price':'mean'})
data2_0_mean_1.plot(kind = 'hist',bins=15,color = '#86bf91')
plt.xlabel("Price")
plt.ylabel("Total Reviews")
plt.title('Number of Reviews & Price')
plt.show()
sns.regplot(x = 'number_of_reviews',y = 'price',
data = data2_0,
scatter=None, color='#86bf91')
plt.xlabel("Total Reviews")
plt.ylabel("Price")
plt.title('Number of Reviews & Price')
reviews_per_month = df[['reviews_per_month','price']]
data2_1 = reviews_per_month.drop(reviews_per_month[reviews_per_month['reviews_per_month'] ==0].index)
data2_1 = data2_1.dropna()
display (data2_1 )
data2_1.describe()
data2_1_low_price = data2_1.drop(data2_1[data2_1['price']>1000].index)
data2_1_mean = data2_1_low_price.groupby('reviews_per_month').agg({'price':'mean'})
display(data2_1_low_price)
data2_1_mean.plot(kind = 'hist',bins=40, alpha=0.7,color = '#86bf91')
plt.xlabel("Price")
plt.ylabel("Reviews Per Month")
plt.title('Reviews Per Month & Price')
plt.show()
cols = ['price','reviews_per_month', 'number_of_reviews']
df1[cols].corr()['price'].to_frame()
There seems to be little to no correlation between the review count (total or monthly) and the listing's price.
With multiple score categories painting a picture of the quality of the AirBnB, there may be multicolinearity between the values of these scores, which could lead to overfitting if we decide to use all the metrics.
cols = ['price','review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value']
corr = abs(df1[cols].corr())
f, ax = plt.subplots(figsize=(11, 9))
# mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, cmap='Blues')
Looking at the correlation matrix, there seems to be a strong correlation between price and the locations reviews, cleanliness reviews, and overall rating. Additionally, there is a collinearity between most of the categories and overall score, meaning we will only use one of these metrics to avoid overfitting.
Location reviews is the only category that has low collinearity with the overall score but since locaton reviews is related to the location of the listing and not the quality of the listing, we will disregard it for this section of the study.
Ultimately, between these categories, we believe using the overall rating feature (review_scores_rating) would be the best way to take reviews into account for our predictive model. Note that the variable has missing values. The missing values will be imputated in the next section of our notebook.
cols = ['price','review_scores_rating']
df1[cols].corr()[cols]
sns.scatterplot(data=df1,y='price',x='review_scores_rating')
Oddly enough, there does not seem to be much correlation between these two variables, which makes us consider whether or not reviews are even worth considering in our model.
Instead of leaving it out, we will opt to create two models, one with review scores and one without, and evaluate the significance of reviews based on which model is better.
data2_0_1_high_price = data2.drop(data2[data2['price']<1000].index)
display(data2_0_1_high_price.describe())
display(data2.describe())
data2_0_1_high_price.plot(kind='scatter', x='number_of_reviews', y='price',alpha=0.8,color = '#86bf91')
plt.xlabel("Total Reviews",fontsize = 20)
plt.ylabel("Price",fontsize = 20)
plt.title('Number of Reviews & Price',fontsize = 30)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.show()
data2_0_mean_high_price = data2_0_1_high_price.groupby('number_of_reviews').agg({'price':'mean'})
data2_0_mean_high_price.plot(kind = 'hist',bins=15, alpha=0.7,color = '#86bf91')
plt.xlabel("Price")
plt.ylabel("Total Reviews")
plt.title('Number of Reviews & Price')
plt.xticks(fontsize = 10)
plt.yticks(fontsize = 10)
plt.show()
sns.regplot(x = 'price',y = 'number_of_reviews',
data = data2_0_1_high_price,
scatter=None, color='#86bf91')
plt.xlabel("Price")
plt.ylabel("Total Reviews")
plt.title('Number of Reviews & Price')
data2_1_high = data2_1.drop(data2_1[data2_1['price']<1000].index)
data2_1_high.describe()
data2_1_high.plot(kind='scatter', x='reviews_per_month', y='price',alpha=0.7,color = '#86bf91')
plt.xlabel("Reviews Per Month")
plt.ylabel("Price")
plt.title('Reviews Per Month & Price')
plt.show()
data2_1_high_mean = data2_0_1_high_price.groupby('number_of_reviews').agg({'price':'mean'})
data2_1_high_mean.plot(kind = 'hist',bins=15, alpha=0.7,color = '#86bf91')
plt.xlabel("Price")
plt.ylabel("Reviews Per Month")
plt.title('Reviews Per Month & Price')
plt.show()
cols = ['price','reviews_per_month', 'number_of_reviews']
df2[cols].corr()['price'].to_frame()
The correlation between the variables remain low (albeit higher than the correlation for low-priced listings).
With multiple score categories painting a picture of the quality of the AirBnB, there may be multicolinearity between the values of these scores, which could lead to overfitting if we decide to use all the metrics.
cols = ['price','review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value']
corr = abs(df2[cols].corr())
f, ax = plt.subplots(figsize=(11, 9))
# mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, cmap='Blues')
Looking at the correlation matrix, the same insights regarding collinearity for low-priced listings can be drawn for high-priced listings.
cols = ['price','review_scores_rating']
df2[cols].corr()[cols]
sns.scatterplot(data=df2,y='price',x='review_scores_rating')
Again, there does not seem to be much correlation between these two variables. Similar to the our low-priced model, we will create two models for high-priced listings (one with reivew scores and one without).
This section will explore variables related the listing's accessibilty and availability, exploring variables like:
The findings for both types (high and low-priced listings yields similar conclusions):
Areas of Improvement:
df.loc[(data['availability_365']==0)]['availability_365'].count()
With over 21,000 listings that have 0 available days in a year, we have decided to drop these rows as they significantly affect the data or the relationship between price and availability.
#I am making duplicated copy of the data frame in order to add some new columns just for my own use in availiability secton
data_avail1 = df1[df1.availability_365 != 0]
data_avail1.loc[(data_avail1['instant_bookable']=='f')]['instant_bookable'].count()
data_avail1.loc[(data_avail1['instant_bookable']=='t')]['instant_bookable'].count()
sns.countplot(data = data_avail1, x='instant_bookable')
data_avail1['instant_bookable'] = data_avail1['instant_bookable'].replace(['t','f'],[1,0])
data_avail1['instant_bookable'].mean()
Most listings are not instantly bookable, with only 36% of the listings are instantly bookable.
data_avail1.groupby('instant_bookable')['price'].describe()
sns.catplot(x="instant_bookable",
y="price",
data=data_avail1, kind='box')
plt.show()
table = data_avail1.groupby('instant_bookable')['price'].describe().reset_index()
sns.barplot(data=table, x='instant_bookable',y='mean')
Instantly bookable rooms have higher prices on average but the difference is not very significant.
data_avail1[['price','instant_bookable']].corr()
It seems that there is no significant correlation between these two variables.
data_avail1['avail_30_rate'] = data_avail1['availability_30']/30
data_avail1['avail_60_rate'] = data_avail1['availability_60']/60
data_avail1['avail_90_rate'] = data_avail1['availability_90']/90
data_avail1['avail_365_rate'] = data_avail1['availability_365']/365
avail_info = data_avail1.describe()[['avail_30_rate', 'avail_60_rate', 'avail_90_rate', 'avail_365_rate']]
avail_info
Looking across the row of 'mean' from the summary table for availability rates above, we can see that the average rate of availability is around 30% no matter how we categorize the time period-- availablity rate in 30 days, availability in 60, availability in a year etc.
hist = data_avail1.hist(column=['availability_30', 'availability_60', 'availability_90', 'availability_365'], figsize=(14,6))
plt.show()
By just looking at the histogram for availability_365 on the bottom right, we can see that most of availabilities gather around 350 days. Many of the availabilities gather in the range from 1 to 110 days
sns.violinplot(x='borough',
y='availability_365', data=data_avail1)
plt.show()
sns.catplot(x="borough",
y="availability_365",
data=data_avail1, kind='bar', ci=None)
plt.ylabel('Days available in a Year (365 Days)')
plt.xlabel('Boroughs in NYC')
plt.show()
# relationship between availability in a year and the price
sns.jointplot(x= 'availability_365',
y= 'price',
data=data_avail1)
plt.show()
data_avail1[['price','availability_365']].corr()
fig, ax = plt.subplots(1,2)
sns.regplot(data=data_avail1,y='price',x='availability_365',scatter=False, ax=ax[0])
sns.regplot(data=data_avail1,y='price',x='availability_365',scatter=True, ax=ax[1])
From the linear regression and correlation matrix, we can see that there is no relationship between price and days available in a year.
Thus, when estimating a price, availability should not be considered as a variable for low-priced listings.
min_night_table = data_avail1[data_avail1.minimum_nights <365]
sns.relplot(x="minimum_nights",
y="price",
data=min_night_table,
kind="scatter", col='instant_bookable', hue='borough')
plt.show()
cols = ['price','reviews_per_month', 'number_of_reviews']
df2[cols].corr()['price'].to_frame()
For minimum nights vs. price, visually there is a slight trend that appears: the higher the number of minimum booking nights, the lower the price.
One hypothesis would be that longer-term rent on Airbnb has a lower price per night than does the shorter-term rent. However, when we calculuate the correlation, there is little connection between minimum_nights on Airbnb and price. Thus, the number of minimum_nights should not be consierdered an effective variable to predict price.
max_night_table = data_avail1[data_avail1.maximum_nights <365]
sns.relplot(x="maximum_nights",
y="price",
data=max_night_table,
kind="scatter", col='instant_bookable', hue='borough')
cols = ['price','reviews_per_month', 'number_of_reviews']
df2[cols].corr()['price'].to_frame()
There is no significant trend present here.
#I am making duplicated copy of the data frame in order to add some new columns just for my own use in availiability secton
data_avail2 = df2[df2.availability_365 != 0]
sns.countplot(data = data_avail2, x='instant_bookable')
data_avail2['instant_bookable'] = data_avail2['instant_bookable'].replace(['t','f'],[1,0])
data_avail2['instant_bookable'].mean()
Most listings are not instantly bookable, with only 44% of the listings are instantly bookable. However, this rate is proportion is higher than lower priced listings.
Let's see if there is a relationship with price and instat bookability.
data_avail2.groupby('instant_bookable')['price'].describe()
sns.catplot(x="instant_bookable",
y="price",
data=data_avail2, kind='box')
plt.show()
table = data_avail2.groupby('instant_bookable')['price'].describe().reset_index()
sns.barplot(data=table, x='instant_bookable',y='mean')
Instantly bookable rooms have higher prices on average but the difference is not very significant.
data_avail2[['price','instant_bookable']].corr()
It seems that there is no significant correlation between these two variables.
data_avail2['avail_30_rate'] = data_avail2['availability_30']/30
data_avail2['avail_60_rate'] = data_avail2['availability_60']/60
data_avail2['avail_90_rate'] = data_avail2['availability_90']/90
data_avail2['avail_365_rate'] = data_avail2['availability_365']/365
# relationship between availability in a year and the price
sns.jointplot(x= 'availability_365',
y= 'price',
data=data_avail2)
plt.show()
data_avail2[['price','availability_365']].corr()
fig, ax = plt.subplots(1,2)
sns.regplot(data=data_avail2,y='price',x='availability_365',scatter=False, ax=ax[0])
sns.regplot(data=data_avail2,y='price',x='availability_365',scatter=True, ax=ax[1])
From the linear regression and correlation matrix, we can see that there is no relationship between price and days available in a year.
Similar to low-priced listings, availability should not be considered as a variable for high-priced listings.
min_night_table = data_avail2[data_avail2.minimum_nights <365]
sns.relplot(x="minimum_nights",
y="price",
data=min_night_table,
kind="scatter", hue='borough')
plt.show()
For minimum nights vs. price, a similar slight trend appears as before, through with fewer datapoints.
max_night_table = data_avail2[data_avail2.maximum_nights <365]
sns.relplot(x="maximum_nights",
y="price",
data=max_night_table,
kind="scatter", col='instant_bookable', hue='borough')
There is no significant trend present here.
Based on our EDA, we determined that out of the initial 70+ features, there are 6 features that hold predictive value, taking into account multicollinearity of certain features (5 features for high-priced listings). They are the following:
accommodates - the number of people the listing accommodatesroom_type - the type of listing (entire home/apt vs. private room vs. shared room vs. hotel room)minimum nights - the minimum number of nights renters are required to stay forstations_per_capita_in_boro - the number of stations per capita in the listing's boroughproperty_price_per_sqft - the average property sale price per sqft in the listing's borough (no predictive value for high-priced listings)review_scores_rating - overall review scores (low correlation in our EDA, but we will test the predictive value in our ML models)After conducting EDA and feature engineering in the previous sections, we have identified columns that have no predictive value for the price variable and features with high collinearity with other features; we have decided to drop these features to prevent overfitting.
# cd ..
df = pd.read_csv('merged_dataset.csv')
drop = ['id','name','description','host_id','neighbourhood', 'borough', 'latitude','longitude','maximum_nights', 'minimum_minimum_nights','maximum_minimum_nights', 'minimum_maximum_nights','maximum_maximum_nights', 'minimum_nights_avg_ntm','maximum_nights_avg_ntm','availability_30','availability_60', 'availability_90','availability_365','number_of_reviews_ltm', 'number_of_reviews_l30d','bathrooms_text','amenities','property_type','host_response_rate', 'host_acceptance_rate','bedrooms','beds','bath_count','bath_type','review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication','review_scores_location', 'review_scores_value', 'reviews_per_month','host_total_listings_count','host_is_superhost','host_has_profile_pic','host_identity_verified','instant_bookable','property_price','Station_count','stations_per_sq-mi_in_boro','has_availability','amenities_count','number_of_reviews']
df = df.drop(columns=drop)
df.info()
Now that we have dropped the columns, we know need to dummify the categorical variables and fill any null values with the appropriate method.
df_dum = pd.get_dummies(df, drop_first=True)
df_dum.head(3)
After dropping irrelevant columns, we still have one column with missing values. Let's impute the null values:
review_scores_rating¶Imputating this variable is a little bit complicated. Replacing nulls with 0 is not appropriate since that assumes the listing is bad, but replacing it with means or medians may not be wise either as it oversimplifies the nuances that go into a review. Therefore, we will use a simple linear regression model to imputate the review values.
df_review = df_dum.dropna(axis=0).drop(columns='price')
df_review.info()
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = df_review.drop('review_scores_rating', axis=1)
y = df_review['review_scores_rating']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted reviews')).head()
table = test.join(pd.Series(y_model, name='predicted reviews'))
sns.scatterplot(x='review_scores_rating',y='predicted reviews', data=table)
from sklearn.metrics import mean_squared_error
import math
RMSE = round(math.sqrt(mean_squared_error(ytest, y_model)),2)
print(f'This predictive model has a RMSE of {RMSE}.')
From looking at the scatter graph, it seems that our model has done an adequate job predicting the review scores. It seems to have only predicted values above ~75 but that's likely because most of the data point we trained on had high review scores, leaning the model to predict higher scores.
Let's see how well this predction compares to using the mean and median to imputate scores:
mean = [df.review_scores_rating.mean() for _ in range(ytest.shape[0])]
median = [df.review_scores_rating.median() for _ in range(ytest.shape[0])]
RMSE = round(math.sqrt(mean_squared_error(ytest, pd.Series(mean))),2)
print(f'Imputing values with the mean will result in a RMSE of {RMSE}.')
RMSE = round(math.sqrt(mean_squared_error(ytest, pd.Series(median))),2)
print(f'Imputing values with the median will result in a RMSE of {RMSE}.')
Across the board, our model has lower RMSE scores than using the mean and median.
This means, when taking into account the size of the errors, using the model to imputate scores will likely be a better method than using the mean or median review scores.
Note that further improvements on the model can be made with deeper EDA and further feature engineering, especially since we are imputing values only with price-correlated features, but with the limited time and scope of the project, we will not be doing this.
Now we will impute any missing values in the review_scores_rating variable with the predicted review score:
df_no_reviews = df_dum.drop(columns=['review_scores_rating','price'])
df_dum['predicted_reviews'] = model.predict(df_no_reviews)
df_dum.info()
count = 0
for i in range(df_dum.review_scores_rating.shape[0]):
if math.isnan(df_dum['review_scores_rating'][i]):
df_dum['review_scores_rating'][i] = df_dum['predicted_reviews'][i]
df_dum.info()
We have successfully imputed the missing values! Now we can get rid of the predicted_reviews column.
df_dum.drop(columns='predicted_reviews',inplace=True)
df = df_dum
df.info()
price_filter_above1000 = df['price'] > 1000
df1 = df[~price_filter_above1000]
df2 = df[price_filter_above1000]
df1
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = df1.drop('price', axis=1)
y = df1['price']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
data = {'coefficient':model.coef_,'variable':Xtest.columns}
pd.DataFrame(data=data)
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()
table = test.join(pd.Series(y_model, name='predicted'))
sns.scatterplot(x='price',y='predicted', data=table)
from sklearn.metrics import mean_squared_error
import math
RMSE = round(math.sqrt(mean_squared_error(ytest, y_model)),2)
print(f'This predictive model has a RMSE of {RMSE}.')
X = df1.drop(['price','review_scores_rating'], axis=1)
y = df1['price']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
data = {'coefficient':model.coef_,'variable':Xtest.columns}
pd.DataFrame(data=data)
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()
table = test.join(pd.Series(y_model, name='predicted'))
sns.scatterplot(x='price',y='predicted', data=table)
from sklearn.metrics import mean_squared_error
import math
RMSE = round(math.sqrt(mean_squared_error(ytest, y_model)),2)
print(f'This predictive model has a RMSE of {RMSE}.')
Our linear regression model resulted in an RMSE score of ~96. It seems that including review scores leads to a small improvement to the score but is arguably negligible, meaning that review scores may not be an important determinant of price.
Based on the coefficints, it seems the factor that best explains price is the room type and the number of people the listing accomodates.
Additionally, both models seem to be predicting negative values, which is clearly an error.
df2
model = LinearRegression()
X = df2.drop(['price','stations_per_capita_in_boro'], axis=1)
y = df2['price']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
data = {'coefficient':model.coef_,'variable':Xtest.columns}
pd.DataFrame(data=data)
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()
table = test.join(pd.Series(y_model, name='predicted'))
sns.scatterplot(x='price',y='predicted', data=table)
from sklearn.metrics import mean_squared_error
import math
RMSE = round(math.sqrt(mean_squared_error(ytest, y_model)),2)
print(f'This predictive model has a RMSE of {RMSE}.')
X = df2.drop(['price','review_scores_rating','stations_per_capita_in_boro'], axis=1)
y = df2['price']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
data = {'coefficient':model.coef_,'variable':Xtest.columns}
pd.DataFrame(data=data)
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()
table = test.join(pd.Series(y_model, name='predicted'))
sns.scatterplot(x='price',y='predicted', data=table)
from sklearn.metrics import mean_squared_error
import math
RMSE = round(math.sqrt(mean_squared_error(ytest, y_model)),2)
print(f'This predictive model has a RMSE of {RMSE}.')
Our linear regression model resulted in an RMSE score of ~2960. This large increase between the two models can be potentially explained by following reasons:
Based on the coefficients, it seems the factor that best explains price for high-priced listings is still the room type and the number of people the listing accomodates; however, the latter seems to have a negative coefficient, which goes against our logic and hypothesis (i.e. the more people it accomodates, the lower the price).
Let's see if a deep learning can improve upon our predictions:
!pip install tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
X = df1.drop('price', axis=1)
y = df1['price']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
dnn_model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[Xtrain.shape[-1],]),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(1)
])
dnn_model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.Adam(0.001))
history = dnn_model.fit(
X, y,
epochs=100,
verbose=0, # suppress logging
validation_split = 0.3) # Calculate validation results on 30% of the training data
test_results = {}
RMSEs = {}
test_results['dnn_model'] = dnn_model.predict(Xtest, verbose=0)
RMSEs['dnn_model'] = round(mean_squared_error(ytest, test_results['dnn_model'], squared=False), 0)
RMSEs['dnn_model']
ax = plt.scatter(ytest,
test_results['dnn_model'].reshape(test_results['dnn_model'].shape[0],),
edgecolors='white')
plt.xlabel('Actual Sales Price')
plt.ylabel('Predicted Sales Price')
plt.title('Deep Neural Network')
plt.plot(linewidth=1, c='red', linestyle='--');
X = df2.drop(['price','stations_per_capita_in_boro'], axis=1)
y = df2['price']
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)
dnn_model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[Xtrain.shape[-1],]),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(1)
])
dnn_model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.Adam(0.001))
history = dnn_model.fit(
X, y,
epochs=1000,
verbose=0, # suppress logging
validation_split = 0.3) # Calculate validation results on 30% of the training data
test_results = {}
RMSEs = {}
test_results['dnn_model'] = dnn_model.predict(Xtest, verbose=0)
RMSEs['dnn_model'] = round(mean_squared_error(ytest, test_results['dnn_model'], squared=False), 0)
RMSEs['dnn_model']
ax = plt.scatter(ytest,
test_results['dnn_model'].reshape(test_results['dnn_model'].shape[0],),
edgecolors='white')
plt.xlabel('Actual Sales Price')
plt.ylabel('Predicted Sales Price')
plt.title('Deep Neural Network')
plt.plot(linewidth=1, c='red', linestyle='--');
It seems that running a DNN model to predict prices leads to a higher RMSE (though not by any significant margin); however, it seemed to have solved the issue our linear regression model ran into when predicting negative prices.
In this iteration of our project, our team was able to obtain a more robust dataset that made it both easier and harder for us. With a plethora of features to potentially use in our model, we had to conduct extensive EDA in order to determine which features had any predictive value. To our surprise, after excluding variables with high multicollinearity, we had very few features (5-6 features) that were ultimately used in our models due to many of the 70+ features seemingly have little to no predictive value.
Deciding to split the dataset into low-priced and high-priced listings essentially doubled our work for EDA but we thought was worth it in order to account for price outliers and be able to predict prices for both price categories. Ultimately, our model that predicted low-priced listings was decent, while the model that predicted high-priced listings did not perform as well.
Based on our model's coefficients, we discovered that the factors that most heavily determine price was the listing's room type and the number of people it accommodates. Other factors like review scores and the average property sales in the listing's borough had some sway but not by much. It may be the case that some of the features impact was under or overstated in our model and EDA, but given the scope and resource limitations in our project, we were unable to explore this possibility further.
General note